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/ABSTRACT 



Aparallel compiler exploits temporal recursion to reduce the 
cost of control code generated in transforming a sequential 
nested loop program into a set of parallel processes mapped 
to an array of processors. A parallel compiler process 
transforms a nested loop program into a set of single loops, 
where each single loop is assigned to execute on a processor 
clement in a parallel processor array. The parallel compiler 
obtains a mapping of iterations of the nested loop to pro- 
cessor elements in the array and a sdiedulc of start times for 
initiating execution of the iterations on corresponding pro- 
cessor elements in the array. Based on this mapping and 
iteration schedule, the parallel compiler generates code to 
compute iteration coordinates on a processor element for an 
iteration of the single loop from iteration coordinates com- 
puted on the same processor element for a previous iteration 
of the single loop. The parallel compiler uses this method to 
generate code to compute loop indices, memory addresses, 
and tests of loop bounds efficiently based on values from a 
previous iteration. 
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PROGRAMMATIC METHOD FOR approach, the processor may have to compute, on the basis 

REDUCING COST OF CONTROL IN ol the current lime, the vector of loop indices that describes 

PARALLEL PROCESSES the iteration that it is about to compute, together with many 

other quantities, such as memory addresses and tests of loop 

RELATED APPLICATION DATA 5 bounds. Due to the complexity of these compulations, it is 

^. , ^. . , . , . . ^ „ . inefficient to re-compute them for each iteration. 
This patent application is related to the following 

co-pending U.S. Patent applications, commonly assigned SUMMARY OF THE INVENTION 

and filed concurrently with this application: . , , , . ^ , . . 

Txr, . , . r. ^«r. . The mventiou pTovides 8 method for exploitmg temooral 

U.S. patent appbcation Ser. No. 09/378,298, entiUed, recursion to reduce the cost of control code generated in 

"PROGRAMMATIC SYNTHESIS OF PROCESSOR ELE- V ™Ln*T,i ^ret^ii ^rrT^^. 1«t^7 

MENT ARRAYS" by Robert Schreiber, Bantwal transfonning a sequenUal nested 1^^ 

„ , . . „ A^-; ±jaLiyyyai parallel processes mapped to an array of processors. The 

Ramakrishna Rau, Shail Aditya Gupta, Vinod Kumar ^^^j^^^ ^ implemented in a parallel compiler process for 

Kathail, and Saduo Anik. • ^ji 

, ^ u ^iiv transforming a nested loop program mto a set of single 

U.S. patent application Ser. No. 09/378,393, entitied, 15 joop^^ where each single loop is assigned to execnite on a 

"PROGRAMMATIC ITERATION SCHEDUUNG FOR processor element in a parallel processor array. 

PARALLEL PROCESSORS" by Robert Schreiber and ^he method obtains a mapping of iterations of a nested 

Alain Uarle. ^oop to processor elements in the array and a schedule of 

Ihe above patent applications are hereby incorporated by start times for initiating execution of the iterations on 

reference. ^ corresponding processor elemeuls in the array. Based on this 

mapping and iteration schedule, the method generates code 

TECHNICAL FIELD compute iteration coordinates on a processor element for 

Ihe invention relates to parallel compiler technology, and ^ iteration of the single loop based on values of the iteration 

specifically relates to compiler methods for reducing control coordinates for a previous iteration of the smgle loop, 

cost in parallel processes. In this context, the term "iteration coordioates" broadly 

encompasses dificrent types of coordinates used to reference 

BACKGROUND an iteration or set of iterations of the nested loop. In the 

implementation, a parallel compiler maps a high level nested 

Parallel compilers are used to transform a computer ioopinsequentialform(e.g.,C,java, or Pascal code) into set 

program mto parallel code that runs on multi-processor 30 ^^^^^^ j^^p^^ mapped to a physical processor 

systems. TraditionaUy, software developers design the com- ^1^^^^^^ T^e parallel compiler maps the iterations to virtual 

piler to optimize code for a fixed type of hardware. A processors, where each virtual processor is assigned a set of 

principal objective of the compiler is to organize the com- iterations, and maps clusters of virtual processors to physical 

putations in the program so that sets of compuUtional tasks processor elements. The iteration coordinates encompass 

in the program may be excaited concurrendy acroK mul- ^5 j^^^ coordinates of a virtual processor in a chister as well 

Uple processors in the specified hardware architecture. ^ quantities that are linearly related to these coordinates. 

Parallel compiler technology extends across a broad range Examples of the coordinates include the global virtual 

of parallel computer architectures. For example, the multi- processor coordinates, and global iteration space coordinates 

processor architecture may employ shared memory in which (e .g., the iteration vector expressed terms of the loop indices 

each processor element shares the same memory space, or of the original loop nest). Linearly related quantities include 

distributed memory in which each processor has a local memory addresses of array elements read or written in the 

memory. loop body. 

One area of compiler and computer architecture research The parallel compiler generates code to compute loop 

focuses on optimizing the processing of computer programs indices and quantities linearly related to these indices based 

with loop nests. Many computational tasks in software on previous values of these quantities on the same processor 

applications are expressed in the form of a multi-nested loop element. For loop indices and linearly dependent quantities 

with two or more loops on a block of code called the loop (such as memory addresses), the parallel compiler selects an 

body. The loop body contains a series of program arbitrarily small time lag so as to minimize the storage cost, 

statements, typically including operations on arrays whose In this approach, the parallel compiler generates a decision 

elements are indexed by loop indices. Such loop nests are tree that implements the computation of iteration coordi- 

often written in a high level programming language code in nates from a value of the coordinates at a previous time, 

which the iterations are ordered sequentially. The processing parallel compiler also generates code to test certain 

of the loop nest may be optimized by converting the loop joop boundary conditions. These tests include tests to deter- 

nest code to parallel processes that can be executed concur- ^ine whether an iteration is at a cluster or tile edge. They 

^^^^y- also include a test to determine whether an iteration is within 

One way to optimize loop nest code is to transform the the bounds of the iteration space. The values of these tests 

code into a parallel form for execution on an array of are boolean values that are temporally periodic. A buffer 

processor elements. The objective of this process is to assign may be used to propagate these periodic boolean values to 

iterations in the loop nest to processor elements and sched- subsequent iterations, thereby avoiding the need to perform 

ule a start time for each iteration. The process of assigning the test over and over. 

iterations to processors and scheduling iterations is a chal- jhe approach outlined above significantly reduces the 

lenging task. Preferably, each iteration in the loop nest cost of the control needed to compute loop indices, loop 

should be assigned a processor and a start time so that each tests, and memory addresses. The parallel compiler gener- 

processor is kept busy without being overloaded. 55 ates control code that is efficient (e.g., a look up or add 

Another challenging task is reducing the cost of control- operation) rather than more time consuming arithmetic 

ling each processor element in a parallel array. In a naive operations. This efficient form of code is advantageous for 



08/27/2004, EAST Version: 1.4.1 



us 6,374,403 Bl 

3 4 

applications in which the loop nest code is compiled to an DETAILED DESCRIPTION 

existing processor array architecture and in which the loop 1.0 Introduction 

nest is transformed into optimized parallel code to be The features summanzed above are implemented in a 

synthesized into a new processor array. parallel compiler for transforming a sequential loop nest into 

The parallel compiler may generate code to implement the 5 parallel code for execution on a parallel processor array, 

loop tests with predicates, where operations in the loop body before descnbmg aspect of the compiler m more detail we 

are guarded by the predicates. In this case, the values of the begm with a secUon of definiUons of terms used throughout 

f. . - J- u 1 1 ♦ the document. We then provide an overview of the parallel 

predicates are periodic boolean values propagated from a ^ ^ 3 p ^ 

prior Iteration, and tne loop body may be synthesized into dcswibes components of the parallel c omnilcr in more 

functional umts that support predicated execution of the lo ^^^^ 

operations in the loop body. This use of predicates makes the 2.O Definitions 

mapping of the loop nest to a processor array more flexible Nested Loop 

because it can be done without the concern that the mapping a nested loop refers to a program in a high level language 

will result in grossly inefficient control code. The test s^ch as C, Java, Pascal etc. that has an n-deep loop nest, 

whether an iteration is scheduled to execute at a given time 55 ^j^^^^ n is an integer. For example, a three deep loop nest 

on a processor element is implemented efficiently with Q^^y the form: 
predicated execution of the loop body. 

Further advantages and features of the invention will 
become apparent from the following detailed description 
and accompanying drawings. 



20 for i «■ iO, nl 

foi j = jO, nl 

BRIEF DESCRIPTION OF THE DRAWINGS ^{"o^'b^dy 



30 



FIG. 1 is a conceptual overview of a design flow for 

transforming a nested loop and specified processor con- ' 
straints into a set of parallel processes, each representing a ^ 

single loop scheduled for execution on a processor element Single Loop 

in a processor array. A single loop is a loop over a single dimension. For 

FIG. 2 is a flow diagram iUustrating the operation of a example, a single loop may take the form: 
system that implements the design flow of FIG. 1. 

FIG. 3 is a flow diagram of a process for transforming a 

nested loop into a two-dimensional loop over time and ^ „ ^^ ^ 

systolic processors. Loop body 

FIG. 4 is an example of a data flow graph for the loop ^ 
body of a finite impulse response filter, annotated with data 35 

dependence information. Time Loop 

FIG. 5 is diagram illustrating the process of tiling an A single loop in which the iteration variable (t in the 

iteration space of a nested loop to partition the loop into sets above example) refers to time, 

of iterations initiated sequentially. Iteration Space 

FIG. 6 is diagram illustrating a null vector superimposed ^0 ^ iteration space refers to the coordinate space of an 

on an iteration space. In this example, the null vector is n-deep nested loop. This is a geometric view of an n-deep 

parallel to one of the dimensions of the iteration space. nested loop. Each point in the iteration space represents the 

no. 7 is a diagram showing an example of scheduling computation corresponding to one iteration of the loop body 

iterations that are mapped to virtual processors. The points f^^ dimension of &e iteratioii space corresponds to a 

in the grid correspond to iterations in the iteration space and level of the loop nest. The coordinates of a pomt m the 

the horizontal boxes represent virtual processors. ^^^^ S^^^^ of the iteraUon 

o j- u ' i^u^v variables of the loop nest, i.e., 1, 1 and k m the above 

FIG. 8 IS a diagram showmg an example of scheduhng , r -» ^ » j 

. . examnle 

iterations where two virtual processors are assigned to each Pr(x:cssor Element 

physical processor element. The horizontal rows correspond . 1 * • -4 r • -n. * 
f . , , J L r i_ J * 50 Aprooessor element is a unit ofprocessmg resources that 

to virtual processors, and each of the boxes corresponds to -. jj.!. i *j c 

, , . . 1 'J * L • 1 IS intended to be replicated mto an interconnected array of 

a cluster of virtual processors assigned to a physical pro- , * >r • n * u 1 i * r 

cessor element r j r processor elements. Typically, it has a local storage (i.e. one 

* • • i_i • or more local memories, local reg;islers and local FIFOs) and 

FIG. 9 is an example of an activity table showmg start includes one or more functional units for executing opera- 
times for an Iteration associated with virtual processors in a ^^^^^ 

cluster. Eadi box in the tabic corresponds to a virtual processor element may be programmable or non- 
processor m a cluster. programmable. The primary difference between program- 

FIG. 10 shows the result of mapping iterations (the points) mable and non-programmable processors Hes in the way that 

from an iteration space to a time— virtual processor space. their control logic is designed. There are the following two 

The vertical lines represent a start time for initialing aU broad approaches for designing the control logic, 

iterations that lie OQ the line. 1. finite state machine (FSM) based control: In this 

FIG. 11 shows the result of mapping iterations from approach, there is no program stored in memory; the pro- 
virtual processor — time space of FIG. 10 to physical cessor contains all of the control logic in the form of a finite 
processor — time space. state machine. The FSM can be implemented using hard- 

FIG. 12 shows a method for generating housekeeping 65 wired logic in which case the processor is non- 
code to compute iteration coordinates, memory addresses, programmable and can execute only one program. It can also 
and predicates efficiently using temporal recursion. be implemented using ^'reconfigurable" hardware such as 
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FPGAs or certain types of PLAs. In this case, the processoi Dynamic Single Assignment (DSA) 

can be re-configured to execute a different program. A program representation in whidi the same victual 

2. Program counter based control: In this approach, the register, EVR element or imiformized array element is never 

control is expressed in the form of a program consisting of assigned to more than once on any dynamic execution path, 

a sequence of instructions stored in a program memory. The 5 The static code may have multiple operations with the same 

processor contains a program counter (PC) that contains the virtual destination register as long as these operations are in 

memory address of the next instruction to execute. In mutually exclusive basic blocks or separated by (possibly 

addition, the processor contains control logic that repeatedly implicit) remap operations. In this form, a program has no 

performs the following sequence of actions: anti- or output dependences as a result of register usagp. 

A. Fetch instruction from the address in the PC. 10 Uniformization 

B. Decode the instruction and distribute the control to the Uniformization is a process of converting a loop body into 
control points in the processor datapath. a dynamic single assignment form. Each datum that is 

C. Update the PC as follows. If the insfruction just computed in one iteration and used on a subsequent iteration 
executed contains either an implicit or expHcit branch and ^ assigned to an element of a uniformized array. This is done 
the branch was taken, then the new value of the PC is the 15 with two objectives in mind. One is to eliminate anti- and 
branch target address specified in the instruction. In all other ^^^P^^ dependences, thereby increasing the amount of par- 
cases, the next value of the PC is the address of the next allelism present in the loop nest. The second objective of 
instruction in the program. uniformization is to facilitate the reduction of the number of 

Virtual Processor accesses (e.g., loads and stores) between local storage of a 

A virtual processor is one of the processors in a hnear ^ processor element and global memory. A variable converted 

mapping of an n-deep nested loop to an (n-l) dimensional ^ process is referred to as a "uniformized variable." 

processor array. Each virtual processor may be thought of as Dependence Graph 

a single process. Possibly one, but typically two or more ^ data structure that represents data flow dependences 

virtual processors are assigned to a physical processor among operations m a program, such as the body of a loop. 

element in a processor array. ^5 In a dependence graph, each operation is represented by a 

Physical Processor Element ^^^^^^ ^ S^^P^ ^^^^ dependence is represented by 

A physical processor element refers to a processor ele- ^ ^!'f''^^ ^f^^ ^ T'^^"!"" ^^T^'"'' ^^.T 

ment that is implemented in hardware. "^^'^^ is dependent. The distance of a dependence is the 

.jy^ number of iterations separatmg the two operaUons mvolved. 

. . . r . J i_ ' 30 Adependence with a distance ofO connects operations in the 

A tile is a set of iterations generated by partitionmg the .... ^ j « ■ 

^ , * ji - . ^ 1. same iteraUon, a dependence from an operation m one 

iteration space of a nested loop mto sets of iterations, where * . *• ♦ *i. ^ i. j- * 

^, ^ *^ . ■ ... * J J iteration to an operation m the next one has a distance of 1, 

mcseisaiecapaDieoiDcmgmiimteaanacompKteaseque^^ and so on. Each dependence edce is also decorated with an 

tially. Iterations withm the tile may be executed m a parallel j j i ,u 4 • - u ^ i 

fooh^^r. ^"S^ delay that specifies the minimum number of cycles 

35 necessary, between the initiation of the predecessor opera- 



fashion. 

Ouster 



. , , ^ tion and the initiation of the successor operation, in order to 

A cluster is a multi-dimensional, rectangular array of ^^^^^ dependence, 

virtual processors that map to a single physical processor in ^ ^ ^^^^^ ^ recurrence if an operation in one 

a processor array. The clusters corresponding to physical iteration of the loop has a direct or indirect dependence upon 

processors are disjomt, and their umon mcludes all of the operation from a previous iteration. The existence 

vurtual processors. ^£ ^ recurrence manifests itself as one or more elementary 

Expanded Vntual Register (EVR) circuits in the dependence graph. (An elementary circuit in 

An infinite, linearly ordered set of virtual registers with a ^ g^^pij ^ through the graph which starts and ends at 

special operation, remap( ), defined upon it. The elements of ^^^^ ^nd which does not visit any vertex on the 

an EVR, v, can be addressed, read, and written as v[n], 45 circuit more than once.) Necessarily, in the chain of depen- 

where n is any integer. (For convenience, v[0] may be fences between an operation in one iteration and the same 

referred to as merely v.) The effect of remap(v) is that operation of a subsequent iteration, one or more depen- 

whatever EVR element was accessible as v[n] prior to the fences must be between operations that are in different 

remap operation will be accessible as v[n+l] after the remap iterations and have a distance greater than 0. 
operation. .50 Predication 

EVRs are useful in an intermediate code representation Predicated execution is an architectural feature that relates 

for a compiler because they provide a convenient way to to the control flow in a computer program. In a computer 

reference values from different iterations. In addition to architecture that supports predicates, each operation has an 

representing a place to store values, EVRs also express additional input that is used to guard the execution of the 

dependences between operations, and in particular, can 55 operation. If the predicate input is True, then the operation 

express dependences between operations in different itera- executes normally; otherwise the operation is "nullified'', 

tions. While useful in scheduling operations, EVRs are not that is, has no effect on the processor state. Consider for 

necessary because dependence information may be specified example the operation: 
in other ways. For example, a compiler may perform loop 

unrolling and provide dependence distances for scheduling 50 '^^^ 

the operations. The operation has the p as the predicate input. If p is true, 

Uniformized Array the operation computes a+b and writes the result into register 

An array of virtual registers such that there is a one-to-one r. On the other hand, if p is False, then the operation does not 

mapping between the elements of the uniformized array and store a new value into r. 

the iterations of a loop nest. Each element of the uniformized 65 Predicated execution simplifies the generation of control 

array is assigned a value exacUy once by a particular hardware as it can be used to eliminate branches from 

operation of the corresponding iteration. programs. Branches imply a synchronized, global control of 
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all of the functional units in that the functional units must be hereby incorporated by reference. See also: Bantwal 
re-directed to perform new operations. Predicated execution, Ramakrishna Rau, Vinod Kathail, and Shail Aditya. 
on the other hand, distributes control to each operation, and Machine-description driven compilers for EPIC processors, 
thus, the control can be implemented cfl&ciently in hardware. Technical Report HPL-98-40, Hewlett-Packard 
Also, since predicate inputs are like other data inputs, code 5 Laboratories, September 1998; and Shail Aditya Gupta, 
transformations that arc used for data, such as arithmetic Vinod KathaH, and Bantwal Ramakrishna Rau. Elcor's 
re-association to reduce height, can be used for predicates. Machine Description System: Version 3.0. Technical Report 
Throughput HPI^98-128, Hewlett-Packard Laboratories, October, 1998. 
Throughput is a measure of processor performance that which are hereby incorporated by reference, 
specifies the number of times a certain computation, such as lO 3.0 Overview of the Parallel Compiler 
an iteration of the loop body, is performed per unit time. FIG. 1 provides a conceptual overview of our parallel 
Initiation Interval compiler. The compiler transforms the program code of a 
A measure of processor performance, and the reciprocal sequential loop nest from using a single address space (e.g., 
of throughput, that specifies the number of processor cycles main memory) 100 to multiple address spaces (e.g., local 
between the initiation of successive iterations of the loop 15 memory, look up tables, FIFOs, ROM, and main memory) 
body. 102. The initial sequential code contains references to a 
Memory Bandwidth global address space implemented in main memory of the 
Memory bandwidth is a measure of performance that computer. The transformed code includes parallel processes 
specifies the quantity of data per unit of time that that can be that access local storage of the processor array and that also 
transferred to and from a memory device. 20 occasionally reference the global address space (e.g., in 
Minimum Initiation Interval (Mil) main memory). Tbc purpose of this stage. in the design flow 
A lower bound on the initiation interval for the loop body is to minimize the use of local storage for eadi processor 
when modulo scheduled on the processor element. Ilie Mil element in the array while staying within the available main 
is equal to the larger of the RecMII and ResMU. memory bandwidth. 
Resource-Constrained Mil (ResMII) 25 The compiler traosfonns the sequential computation of 
A lower bound on the Mil that is derived from the the nested loop into a parallel computation 104. This stage 
resource usage requirements of the loop body (e.g, the applies various parallel compiler methods to map the nested 
functional units required to execute the operations of the loop code from its original iteration space to a time- 
loop body). processor space. 
Recurrence-Constrained Mil (RecMII) 30 Additionally, the compiler maps the parallel computation 
A lower bound on the Mil that is derived from latency to the specified number of processors. In the implementation 
calculations around elementary circuits in the dependence detailed below, this mapping yields a synchronized parallel 
graph for the loop body. computation on an array of physical processor elements. 

MacroceU Library (also referred to as Macrocell The compiler transforms the parallel computation 

Database) 35 assigned to each processor into a single, non-nested loop to 

A macrocell library is a collection of hardware compo- yield a synchronized paraUel program. At this stage, a single 

nents specified in a hardware description language. It time loop is assigned to each physical processor element, 

includes components such as gates, multiplexors (MUXes), Hje parallel compiler has transformed the loop nest program 

registers, etc. It also includes higher level components such into a form that enables a hardware synthesis process 106 to 

as ALUs, multipliers, register files, instruction sequencers, 40 convert this collection of single time loops to a hardware 

etc. Finally, it includes associated information used for structure representing the physical processor array. The 

synthesizing hardware components, such as a pointer to a parallel compiler may also be designed to transform the loop 

synthesizable VHDLA^erilog code corresponding to the nest to an existing processor array architecture, 

component, and information for extracting a machine FIG. 1 depicts a specific example of this processor array, 

description (MDBS) from the functional imit components. 45 In this example, the processor array comprises an array 108 

In the current implementation, the components reside in a of data path elements 110. Typically controlled by a general- 

macrocell database in the form of Architecture Intermediate purpose computer, the array receives control signals via a 

Representation (AIR) stubs. During the design process, co-processor interface 112 and array controller 114. When it 

various synthesis program modules instantiate hardware receives a command to start executing a loop nest and is 

components from the AIR stubs in the database. The MDBS 50 initialized, the processor array executes the loop nest and 

and the corresponding information in the functional unit returns a signal indicating it is done, 

component (called mini-MDES) are in the form of a data- 3.1 The Parallel Compiler 

base language called HMDES Version 2 that organizes FIG. 2 is a flow diagram providing an overview of the 

information into a set of interrelated tables called sections parallel compiler. The parallel compiler performs data flow 

containing rows of records called entries, each of which 55 analysis to construct a data flow graph and extract variable 

contain zero or more columns of property values called references (204). Next, it maps iterations of the nested loop 

fields. For more information on this language, see John C. to processor elements based on a desired processor topology 

Gyllcnhaal, Wcn-mei W. Hwu, and Bantwal Ramakrishna (206). Using this mapping, the processor constraints and 

Rau. HMDES version 2.0 specification. Technical Report data dependence constraints, the paraUel compiler performs 

IMPACT-96-3, University of Illinois at Urbana-Champaign, 60 iteration scheduling to determine a start time for each 

1996. iteration that avoids resource conflicts (208). Finally, the 

For more information on MDES and the process of parallel compiler transforms the code from its initial itera- 

re-targeting a compiler based on the MDES of a target tion space to a time-processor space 210, 212. The output 

processor, see U.S. patent appKcation Ser. No. 09/378,601, specifies a single time loop for each processor element. To 

entitled PROGRAMMATIC SYNTHESIS OF A MACHINE 65 exploit paraUelism of the array, the single time loops are 

DESCRIPTION FOR RETARGETING A COMPILER, by scheduled to execute concurrenUy in the processor elements 

Shail Aditya Gupta, filed concurrently herewith, whidi is of the array. 
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3.2 Summary of Implementation 

The process illustrated in FIGS. 1 to 2 is implemented in 
a collection of program modules that together form a parallel 
compiler for generating code to run on a synchronous 
processor array. The compiler system takes a loop nest in a 5 
high level programming language such as C and generates a 
set of synchronous parallel processes for synthesis into a 
synchronous processor anay. The parallel processes are 
designed to satisfy specified performance and processor 
constraints. Examples of the performance constraints 
include the execution time (e.g., schedule length), and the 
throughput of the processor array. The processor constraints 
include the number of processors in the array, the topology 
in which they are connected together, and the available 
memory bandwidth. 

The code generated in the compiler serves as a specifi- 15 
cation of a synchronous processor array that implements the 
nested loop. This array consists of nearly identical processor 
elements. These elements are "nearly identical" in that the 
boundary processors may be different than interior proces- 
sors. In some cases, "dead" hardware may be removed from 20 
some processor elements (e.g., hardware synthesized from 
code dependent on a predicate that is always false). 

The processor elements are connected in a one or two 
dimensional grid. The size and dimensionality of this array 
is referred to as its topology. In general, each processor can 25 
be connected to a set of neighboring processors. Preferably, 
the design system should ensure that the time taken to go 
from one processor to its neighbor is bounded by an upper 
bound, which may be one of the parameters that is specified 
at design time. The processors that can be reached in the 30 
specified time are referred to as neighbors. 

Each processor element typically contains a certain 
amount of local storage, which may be organized as 
registers, register files, or local memories. Similarly, the 
processor array may have storage local to the array, which is 35 
organized as local memory (e.g., RAM). These local memo- 
ries can be used to reduce the bandwidth between the 
processor array and global memory. 

Each of the processor elements execute nearly the same 
program. Again, some of the boundary processors may 40 
execute code that is slightly different from the code executed 
by the interior processors. 

The processor array is referred to as being synchronous in 
that each of the processor elements execute their respective 
parallel processes in lock step. On each cycle, each proces- 45 
sor element executes the same instmction, and if one needs 
to stall, then all do. As a result, any needed synchronizatioD 
can be guaranteed at the time that the array is designed. 

The compiler system is implemented in collection of 
program modules written in the programming language. 50 
While the system may be ported to a variety of computer 
architectures, the current implementation executes on a 
PA-RISC workstation or server mnning under the HP-UX 
10.20 operating system. The system and its components and 
functions are sometimes referred to as being "program- 55 
matic." The term "programmatic" refers to a process that is 
performed by a program implemented in software executed 
on a computer, in hardwired circuits, or a combination of 
software and hardware. In the current implementation, the 
programs as well as the input and output data stmctures are 60 
implemented in software stored on the workstation's 
memory system. The programs and data structures may be 
implemented using standard programming languages, and 
ported to a variety of computer systems having differing 
processor and memory architectures. In general, these 65 
memory architectures are referred to as computer readable 
media. 
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4.0Parallel Compiler Components and Methodology 

FIG. 9 is a flow diagram illustrating an implementation of 
the parallel compiler. Each of the components in the parallel 
compiler is described in sub-sections 4.1 to 4.3. 

To illustrate the process of transforming a loop nest into 
a single time loop per processor element, it is helpful to 
consider a running example. The following sections explain 
this process using the following loop nest as an example. 



int i, j, nl, n2 

/* The actual data arrays in memory V 
float x[n]], w(n2], y[nl-n2+l]; 
/* Loop Nest V 
for (i = 0; nl-n2, i++) ( 
for (j = 0; j < n2; 

y[i] = y[i] + ' A^+il 

} 



The above code example is a Finite Impulse Response 
Filter (HR). The nested loop has the following characteris- 
tics: there is no code, except the statements in the innermost 
loop body, and there is no procedure call in the inner loop. 
Exact vsdue flow dependence analysis is feasible for this 
loop as there are no pointer de-references or non-affine loop 
indices. 

4.1 Data Flow Analysis 

The data flow analysis phase (300) takes a high level 
program (e.g., written in C), including one or more nested 
loops (302), identifies the nested loop or loops to be trans- 
formed into a synchronous processor array, and performs a 
data dependence analysis on each loop nest. The term 
"synchronous'' in this context means that the processor 
elements share a ooinmon clock. Since the analysis is similar 
for each loop nest, the remaining discussion assumes a 
single nested loop. 

4.1.1 Creating the Dependence Graph 

The data dependence analysis creates a data flow graph 
(DFG)(304) for the loop nest. This graph is an internal data 
stmcture representing the data dependences among opera- 
tions in the loop body, and dependences between operations 
in different iterations of the loop body. 

FIG. 4 shows an example of the DFG for the example 
above. The graph records dependences between operations 
in the loop body. It also records dependences between 
iterations: read operations consuming input from another 
iteration, and write operations producing output to another 
iteration. 

The edges of the DFG are annotated with dependence 
distances and latencies. A dependence distance is an integer 
vector of size n giving the difference between the iteration 
indices of an operation that produces a value and the 
iteration indices of the iteration that uses the value. For data 
values that are used at many iterations, we introduce a DFG 
self-loop annotated by the distance between nearby itera- 
tions that use the same value (e.g., 406, FIG. 4). 

Edges are also annotated with operation latencies as 
explained in the next section. 

4.1.2 Extracting Latency Information 

An operation latency refers to the time in cycles between 
sampling an input of an operation and producing a result 
available for a subsequent operation. In the implementation, 
the data flow analysis phase extracts latency information 
firom a database containing a machine description (MDES) 
that stores latencies for various operations in a given archi- 
tecture. 

The DFG shown in FIG. 4 is annotated with latency 
information. In this example, the latency of a multiply 



08/27/2004, EAST Version: 1.4.1 



us 6,3' 

11 

Operation is three cycles (410), and the latency of an add 
operation is two cycles (412). The latency of these opera- 
tions is dependent on the particular functional units that will 
implement them in the processor data path. The data flow 
analysis phase identifies the operations in the DFG and looks 
up the corresponding latency of these operations in the 
MDES section of a macrocell database. As shown in FIG. 4, 
the data flow analysis phase also assigns a latency to edges 
in the DFG that represent the latency of communicating an 
array element from one processor element to another 
(414-420). This infonnation is added to the data flow graph 
after mapping iterations to processor elements. 

In addition to the data flow graph, another output is a file 
of array references (FIG. 3, 306) in the loop body. This file 
contains one record per array indicating how the array is 
accessed. As explained below, this information is used to 
estimate the number of array elements read or written by a 
set of iterations. 

Several data flow programs may be used to implement 
this stage. The current implementatioa uses a program called 
Omega from the University of Maryland. 

4.2 Preparation of Transformation Parameters 

Before the parallel compiler transforms the code, it deter- 
mines a mapping of iterations to processor elements and an 
iteration schedule. In general, the iterations in an n-deep 
loop nest are identified by the corresponding integer 
n-vector of loop indices in the iteration space. The mapping 
of iterations to processor elements identifies a corresponding 
processor element in an (n~l) dimensional grid of processor 
elements for each of the iterations of the nested loop. 

As a practical matter, the iteration mapping will not match 
the desired topology of a physical processor array. As such, 
the implementation views the mapping of iterations to an 
(n-1) dimensional array of processor elements as a mapping 
to virtual processors. It then assigns virtual processors to 
physical processor elements. 

The parallel compiler determines an iteration mapping 
and iteration schedule based on processor constraints (308) 
provided as input. These constraints include the desired 
performance (e.g., an initiation interval), a memory band- 
width constraint, and a physical processor topology speci- 
fied as a one or two dimensional processor array. 

The implementation uses tiling to reduce local storage 
requirements of each physical processor element. It also 
employs clustering to map virtual processors to physical 
processors. While tiling is not required, it is useful for many 
practical applications. Tiling constrains the size of the itera- 
tion space and thus reduces local storage requirements. 
However, when the tile size is reduced, the memory band- 
width between local and global memory tends to increase. 
As the tile size shrinks, there are fewer iterations per tile and 
the ratio of iterations at the boundary of the tile relative to 
other iterations in the tile increases. Each local processor, 
therefore, benefits less from the re-use of intermediate 
results in the tile. 

4.2.1 Determining the Tile Size 

The process of tiling partitions the iterations of the nested 
loop into tiles of iterations that are capable of being initiated 
and completed sequentially. FIG. 5 shows a conceptual 
example of tiling a two-dimensional iteration space with 
iteration coordinates (ij) into three tiles. In this example, the 
tiler has partitioned the iteration space (500) into three tiles 
(502, 504, 506) along the j dimension. 

From the example, one can see that the ratio of iterations 
at the tile boundary to the other iterations in the tile increases 
when the tile size decreases. Thus, tiling is likely to increase 
memory bandwidth requirements because of the transfers 
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between main memory and local storage that typically occur 
at tile boundaries. 

The parallel compfler programmatically determines the 
tile size (310) using the processor constraints (308) and the 

5 array references extracted in the data flow analysis. It treats 
the process of determining a tile size as a constrained 
minimization problem. It picks a tile shape (i.e., the dimen- 
sions of the tile) so that the number of iterations in the tile 
is minimized while satisfying the constraint of a specified 

10 main memory bandwidth. The specified memory bandwidth 
is provided as input to the parallel compiler as one of the 
processor constraints. To determine whether a selected tfle 
shape satisfies the bandwidth constraint, the parallel com- 
piler estimates the bandwidth using the following e^res- 

15 sion: 

Estiituted bandwidth-OBstimated Tcaffic/Estimated Cyde C&unt) 

The paraflel compiler estimates the traffic (i.e. the esti- 
^ mated transfers to and from main memory) based on the 
dimensions of the tile and the array references of the 
iterations in the tile. It estimates cycle count based on the 
number of iterations in the tile, a specified initiation interval, 
and the number of processors. 
^ The estimated cycle count is expressed as: 

Estimated Cycle Couat-(tterations per tile * initiation inteml)/ 

number of processors) 

(The units of the initiation interval are processor cycles 
30 per iteration. The units of the estimated cycle count are 
cycles.) 

A variety of cost minimization algorithms may be used to 
solve this constrained minimization problem, such as simu- 
lated annealing or hill climbing. 

35 In some cases, the tile size chosen by the tile size routine 
may cause the iteration scheduling program module process 
to generate a set of schedules that does not satisfy the 
dependence constraints of the nested loop. In this case, the 
parallel compiler expands the tile size and generates a new 

40 set of schedules. The selection of the tile size is complete 
when the set of schedules includes one or more schedules 
that satisfy the dependence constraints. The form of these 
constraints is described further below. 
4.2.2 Determining an Iteration to Processor Mapping 

45 Next, the parallel compfler dctemiines a mapping of 
iterations to processor elements (312). As introduced above, 
the mapping assigns iterations of the nested loop to virtual 
processors. This mapping produces an (n-1) dimensional 
grid of virtual processors, each having a set of iterations. The 

50 front-end assumes that the mapping is a Unear projection of 
the iterations in the iteration space in a cbosen direction 
called the "nuU direction." Each virtual processor is 
assigned a set of iterations that lie along a straight line in the 
null direction. 

55 Conceptually, the mapping satisfies the following expres- 
sion: 

n"u-o, 

where 11 is an (n-1) by n matrix and iT is the null vector. 

For a given vector u , there are infinitely many mappings 
that satisfy this expression. However, as a practical matter, 
there are certain choices that are more straight-forward to 
65 implement. In the implementation, the null direction is 
selected to be parallel to one of the iteration coordinates. 
FIG. 6 illustrates an example where the null vector (510), 
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shown superimposed on the iterations space, is parallel to provides a start time for each iteration on a processor 

the i coordinate dimeasion (512) of a two-dimensional (ij) element such that no more than one iteration is started on a 

iteration space. processor element for each initiation interval, and after some 

Given the null direction parallel to one of the iteration initial period, exactly one is scheduled for each initiation 

coordinates, the implementation chooses a mapping such s interval. If the initiation interval is one cycle, for example, 

that the virtual processors correspond to the remaining (n-1) and only one iteration starts executing on each 

dimensions in the iteration space. Thus, 11 is chosen to processor element per cycle. 

consist of n-1 rows of the identity matrix. Given a cluster shape and a mapping of iterations to 

As explained later, it is act necessary to choose a null processor elements in a null direction pardlel to one of the 

direction parallel to one of the iteration space coordinates, lo f'f}''^ coordinates^ the iteration scheduler produces a 

The iteration scheduling technique described below, for definiUon of tight schedules: 

example, applies to other linear mappings of the iteration 

space in a null direction that is not parallel to one of the 

iteration space coordinates. . rr^ ^ \ • * • • *u u r*u 

While the implementation obtains a mapping of itecaUons i5 where CKQ, ;.• ,C;^.) is a vector giving the shape of the 

. . ^ J „ J. ^. ... 1 1 , cluster of virtual processors assigned to each physical 

based on a user-specified null direction, it is also possible to _^ ^ ** ^ 

select a null direction and mapping programmatically. The processor and k =(ki, . . . Xn-i) ^ ^ vector of integers 

user may provide a null vector and allow the system to select ^^t has the property that the greatest common denomi- 

a mapping matrix. Conversely, the user may provide a ^ Q ^ one and k„=±l. A preliminary 

mapping matrix and alk>w the system to derive the null 20 permutation of the axes of the cluster may be applied 

direction (the mapping matrix determines the null vector (e.g., when n=3, 'x ^Qa^XzC iM^CiC^) or T-OcjCaJca, 

uniquely). Finally, the system may select a null direction and i^C^ Cj)). As such, schedules where such a preliminary 

a mapping matrix programmatically. permutation is applied should be deemed equivalent. 

As noted above, the implementation assigns clusters of ™, r l-. - j^r"* j 

, / r • 1 1 T ^ ^ The start tune of each iteration is the dot product of X and 

vu:tual processors to physical processor elements. In order to 25 ,1. % j - * •* *• 

. . r . . , 1, 1 the iteration coordinates of the iteration: 

sdiedule iterations on physical processors, the parallel com- n^ianKiu ..uuiuiuai^ xii^iaviuu. 

piler uses the cluster shape as an input. The user specifies the ^ 

topology of the physical processor anay. The parallel com- *^ ^ 'H^Ji^k^CJi+^c^CJy* . . . +Ai,Ci . . . C^J„ 
piler defines the cluster shape as: 

30 where j is a vector of loop indices. 

"c-f v/?l The null direction need not be in a direction parallel to one 

of the iteration space coordinates. The above definition may 

L • 1 . . Tt • L r 1 he used to determine a tight schedule for any linear trans- 

where C is the cluster shape, V is the shape of the virtual j^^^^jj^^ .j^^^,^^^ ^^^^ 

processor space, and P is the shape of the physical 35 To show this concept, let S be the inverse of a unimodular 

processor array. TTiere are a number of possible choices extension of n. The last cotamn of S is the quU vector u. 

for associatmg axes of the virtual processor ^ace with cohimns are the vectors that descrfte the 

axes of the physical processor space. Section 4.2.3 ^^^^ j„ particular, the first (n-1) rows of 

describes one way to optimize the selection. g-, projection matrix H. The transformation matrix 
4.2.2 Iteration Scheduling 40 

After detennining an iteration to processor mapping, the M is the matrix whose first row is x and whose last (n-1) 

parallel compiler performs iteration schedulmg (314) to rows are 11: 
compute a definition of schedules compatible with the 

specified processor constraints (308). f7] ( — t 

no. 7 illustrates the concept of iteration scheduling of 45 ^ \uf [y)^^ 
virtual processors. In this phase, an iteration scheduler finds 
a start time for each iteration to satisfy certain constraints, 

e.g., the data dependences in the nested loop. In this t^e mapping &om iteration T to time t and virtual 
example, the virtual.processors are represented as horizontal 

boxes (520-530), each containing a set of iterations. The so processor v . We now change basis in the iteration space: 

schedule is a vector x that defines a start time for each ^'-S'^T^re the coordinates of the iteration with respect to 

iteration. the basis consisting of the columns of S. In this basis, the 

FIG. 8 alters the example of FIG. 7 by extending it to transformation becomes: 
show iteration scheduling of clustered virtual processors. 

The horizontal rows correspond to virtual processors, and 55 (^.s)^ ( t S \ 

each of the boxes (532, 534, 536) corresponds to a cluster of I j = MS/ = 1 J = I 1 7 

virtual processors assigned to a physical processor element. ^ ' >/ 
Each start time in the iteration schedule is depicted as a line 

through the iteration space. The points intersecting each time -> 

Une are the iterations that begin execution at that time. Due 60 Clearly, x is a tight schedule with cluster shape C and 

to resource sharing conflicts, it is not possible to schedule mapping 11 if and only it x .S is a tight schedule for ? with 

two iterations to start at the same time on the same processor. the mapping (I„_i 0). Hence, the generalized condition (1) 

More specifically, the iteration scheduler ensures that only ^ - . j... ^. 

,1 . ^. . ^ . apphed to x .S is a necessary and sufficient condition for a 

one Iteration starts on a processor per iratiation interval. / * . ^ ^ c % J * *u 

u J 1 • *u 11 1 1 . ,r tightschedule. The formula does not specify the components 

The iteration scheduler m the parallel compiler generates 65 _^ ^ _^ ^ 

a definition of tight schedules for a given mapping of of x but rather the components of x .S and x is recovered 

iterations to processor elements. A tight sdiedule is one that through the integer matrix S~^. 
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In addition to the constraint that the schedule must be 
tight, the schedule must also satisfy the dependence con- 
straints. These constraints are expressed as a system of linear 
inequalities of the form: 

where A is a matrix of ofifeets and b is a vector of time 
intervals. 

An example of the linear inequality constraints for the 
example of the FIR filter and its annotated DFG graph of 
FIG, 4 is: 
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Ihe entries of the rows in A come from the sums of the 
distance vectors on the edges of elementary circuits in the 
dependence graph. The entries of b are from the latencies of ^ 
macrocells and from the latencies of interprocessor commu- 
nication. 

The top element of b represents the loop on w, the middle 
element represents the loop on x, and the third element 
represents the loop on y. The latency for the loop on w is 0 ^ 
cycles because the value of this array element is available 
within the processor element. The latency for the loop on x 
is 1 because the latency of interprocessor communicatioo 
between adjacent physical processor elements is assumed to 
be 1 cycle. Finally, the latency for the loop on y is the sum 
of the latency for an add operation (2 cycles in the example 
of FIG. 4) and the latency for interprocessor communication 
(1 cycle). 

The implementation of the iteration scheduler employs 
linear programming to select values of k for use in formula 

(1) so that the values of t are likely to satisfy these hnear 
inequahty constraints. Starting with the system of linear 
inequaUties from the data flow analysis, the iteration sched- 
uler computes an upper and lower bound on x . These 
bounds represent the smallest vector box containing the 

solution to Ax >=b. Based on the bounds on x , the iteration 
scheduler computes bounds on the values of k. With these 
bounds, the iteration scheduler can then select several values 
of k that fall within the bounds and then exclude all of the 
selected values where the greatest common denominator 
with the corresponding value of C is not equal to 1. This 
approach provides a direct and efficient programmatic 
method for selecting several iteration schedules that satisfy 
the desired constraints. 

To illustrate iteration scheduling, it is instructive to look 
at an example. 

Let n«3; let ^«(4,5). Assume that u«(0,. 0, 1) is the 
smallest integer null vector of the space mapping. From (1), 55 

either x =(ki,4k2, ±20) or T«(5ki4c2, ±20) where the 
greatest common divisor of k; and Q is 1 for i=l, 2. For 

example, x =(7, 4, 20) is a tight schedule (with ki=7, k2=l, 
kg^l) that corresponds to the activity table shown in FIG. 9. 60 
The example in FIG. 9 shows the activity table for a four by 
five cluster of virtual processors (540) assigned to a physical 
processor element (e.g., processor element 542). In this 
example, the topology of the physical processor array is five 
vertical processor elements by four horizontal elements. The 65 
arrows joining the processors depict inteiprocessor commu- 
nication. 



The number in each box of the activity table denotes the 
residue modulo 20 of the times at which the virtual processor 
in that position within the cluster is active. For a tight 
schedule, these are all dififerent (the c-^ axis is the vertical 
axis in the diagram). 

4.2.3 Optimizing Cluster Topology 

As described above, the cluster shape is derived from the 
mapping of virtual processor axes to physical processor 
axes. There are typically two or more choices for this 
mapping that will produce a satisfactory iteration mapping 
and scheduling. The parallel compiler code may simply pick 
one of these choices. It can achieve a better result, however, 
by trying each of the possible choices and piddng one that 
results in the best cost and performance (e.g., shortest total 
schedule). 

4.3 Code Transformation 

After determining the iteration mapping and scheduling, 
the front-end performs a series of loop transformations. 
These transformations are described in sections 4.3.1 to 
4.3.5. 

4.3.1 Tiling 

The tiling process (318) partitions the iterations in the 
nested loop based on the tile dimensions computed previ- 
ously. Tiling transforms the code to a sequential loop over 
tiles having the form: 



for (die) 

for (point in tile) { 
} 



AppHed to the running example of the FIR filter, tihng 
yields: 



r This loop runs on the host, and loops ova tiles */ 
for jb = 0; jb < n2; jb +=tile_5ize_2 

/* loop nest, over one die •/ 

for (i = 0; i ^ nl-n2; L++) 

for Op = 0; jp < tile_size__2; ip++)_ { 

i"jb+ip; 

y[i]-y[i] + wU]'x[i+j]; 

} 



} 



4.3.2 Uniformization 

Uniformization is a method for transforming the code so 
as to eliminate anti- and output dependences, and to reduce 
transfers between local and global memory (e.g., load and 
store operations in a processor element). By eliminating 
depedences, uniformization increases the amount of paral- 
lelism present in the loop nest. It also reduces accesses 
between local storage and global memory. Preferably, each 
processor array should propagate a result from a producer 
operation in one iteration to a consumer operation in another 
iteration without accesses to global memory, except as 
necessary at tile boundaries. As part of this code transfor- 
mation process, the parallel compiler converts arrays in the 
loop body into uniformized arrays. 

To uniformizc an array, the parallel compiler uses the 
dependence data to convert the array indices such that each 
clement of the array is defined at one unique iteration (320). 
This process is referred to as dynamic single assignment. 
Uniformized arrays are realized by the back end as registers 
local to the physical processor elements, while ordinary 
arrays hold data in global memory (e.g., main memory of the 
host processor). 
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Based oo the dependency data in the data flow graphs the After this phase, the loop nest has the fonn: 
uniformized arrays for our running example are: 

WW [ilj] which holds wO] 

XX [ilj] which holds xti+j] t p form 

YY MD] which holds y[i] p .J ^ 

from the dependence relation involving x we know that 

XX[iIj]-XX[i-lIj+l] 

We can use this to define XXTillj] whenever (i-lj+l) is ^ ^ • , , , , . 

i.j V V -A J • *■! - r, -, 10 The outer loop IS a loop over time, and the inner loop IS 

a vahd iteration, i.e., when i>0 and ip<tile__size__2-l. . u i i 

. ^. .j^.j. ^ loop over physical processor elements. Each processor 

Odierwise, we need to load data ftom global memory as ^j^^^^^ ^^^^^^/^ J^^^ ^^^p ^ ^ ^equenUal order in 

shown below: l^^^j^. ^^^p p^^allel with the others, 

if (i=0 or jp=tile_size_2-l) 43,6 Recurrence Optimizations 

XX[i]0]-x[i+j] 15 In a recurrence optimization phase (324), the parallel 

The uniformized loop nest over one tile in full is compiler uses temporal recurrence to transform the loop 

programmatically from t-sp fonn to an optimized t-sp form 
(326). 

After transformation to t-sp form, the loop body serves as 



foj (i-O; i< nl-n2; i++) ^ a Specification of each processor element. In an implemen 

for (jp-O; jp<tile_size_2; jp++) { tation that employs tiling, the loop nest has outer loops over 

j-jb+jp; ^jjgs and an inner nest consisting of a sequential loop over 

YYli] lip] - YY li] ljp-1]' ^ parallel nest over processors, 

else YY li] [jp] - yfi]; * Th^ transformed parallel loop body contains control code 

if (i>o && jp < tac_sizc_2-i) that we refer to as "housekeeping code." The housekeeping 

xxli] [jp] - xxii-i] [jp+i]; ^ code includes code for computing various iteration 

dse TQ^i] [jp] - x[i-t-j]; coordinates, code for computing memory addresses, and 

wwn f 1 » wwfi-il f I testing loop bounds. Due to its computational 

else ww[i]ljp] " wOl/ complexity and potential memory usage, this code can be 

YY[i] lip]»YYli] ljp]+XX(i] [ipl*ww[i) [jp] <l^le costly io terms of computation time, memory usage, 

if (]*p««tii6_size 2-1) 30 and memory bandwidth. It is particularly important to 

yIi]=XX[i] Ijpl; reduce these costs to make the synthesized processor array 

^ more efficient. 



The parallel compiler uses temporal recurrence to imple- 

4.3.4 Loop Transformation mcnt the housekeeping code. The parallel compiler gener- 

In the next phase of loop transforms, the parallel compiler 35 ates code to compute various iteration coordinates, memory 

converts the current form of the loop nest to a lime-physical addresses, and loop bound tests locally in a processor 

processor space based on the selected iteration schedule element using values computed from a previous iteration. 

(322). This method of exploiting temporal recurrence is imple- 

Tbe parallel compiler converts the code to physical pro- mented in two forms: 1) code is generated to update certain 

cesser code using clustering. In this phase, it transforms the 40 values from a previous iteration at a seleaable time lag 

code from iteration space to time-physical processor space. behind the current iteration; and 2) code is generated to 

The t loop bounds are derived from the schedule and the tile maintain quantities that repeat periodically m a local storage 

shape; the processor loop bounds are given by the selected buffer for later reuse. 

processor topology. The first form is implemented as a decision tree. The local 

References to a uniformized array such as XX[i-l]|jp+l] 45 processor clement follows a path through the decision tree to 

are transformed first into a reference of the form XX[t-8t] on& of its leaves. The value of the quantity for the current 

[v^-8v^] where 8t stand 6vi are given by iteration is obtained by adding a constant to the value of that 

quantity from a previous iteration. The leaf provides the 

\ Si ] I r 1 r 11 y^uc of the added constant. 

[ <5vi J ^ n [ - 1 r ^° second fonn is implemented in a local storage buffer 

having a depth conesponding to the repeat period of the 
desired quantity on the processor element. As explained 

These being the offset in the time and virtual processor further below, this repeat period is the number of virtual 

coordinates between the start of iteration (i, jp) and the start processors assigned to a physical processor element. 

. . t. -* J 55 These approaches take advantage of the tight iteration 

of Iteration JP+l)- Fo^^ example, whenx -(2-3) and schedule. In such a schedule, the physical processor visits 

n-(0,l), XX[i-l][,p+l] is changed to XX [t-5][v,+l] each virtual processor assigned to it in a round robm fashion, 

nien, this reference is converted to the tune, physical ^-^ ^^^^ ^^^^^^^^ ^^^^^ ^^^^^^^ 

processor coordinates usmg the chister shape and the duster virtual processor. 

coordinate Cj. The assignment statement implementation, the housekeeping code has several 

XX[tIvi]=XX[t-5lv3+l]; forms and functions: 

becomes Cluster Coordinates The cluster coordinates, sometimes 
if c ==1") referred to as "local virtual processor coordinates," give 
YYF IF 1 YYF KIT 11 position of the currently active virtual processor. For 
XX|.tJLspi]-XA[t-5J[spi+l]; ^ g^ygjj ^ processor element, the processor ele- 
gise ment may need to compute the currently active virtual 
XX[t][spi]-XX[t-5][spi]; processor. 
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Global Virtual Processor Coordinates The global virtual addition is needed to compute each such value. This 

processor coordinates give the position of a virtual pro- approach reduces the problem of cost reduction to that of the 

cessor in the time-virtual processor space. update of the cluster coordinates. 

Iteration Space Coordinates The iteration space coordinates Next, wc describe how to generate this decision tree 
are the coordinates of an iteration in the original coordi- $ programmatically. It is possible to construct the decision tree 

nate space of the nested bop. These coordinates some- without resorting to construction and exploration of the 
times appear in the transformed code. 

Memory Addresses The memory address is the location of activity table. Since x is tight, we know that (up to 

an array element whose indices are affine functions of the permutation of the indices, and with the proviso that 11 

iteration space coordinates. When data is live-in or consists of the first (n-1) rows of the identity: 
Live-out to the loop nest, it is read in or stored into global 

memory. In these drcimistances, the local processor needs 7-(*iJk*Ci, . . . , K^tCi . . . c^_2, (2) 
to compute the global memory address for the element 

Boundary Predicates Sometimes referred to as guards, these where the greatest common divisor(k,.,Ci)-l for all 

quantities represent tests of loop boundary conditions. l=i=n-l andY^Ci . . . C„.i. 

Ibese loop tests include cluster edge predicates, tile edge We consider the set C of cluster coordinates within a 
predicates, and iteration space predicates. Cluster edge cluster of virtual processors, i.e. within one physical pro- 
predicates indicate whether an iteration is at the boundary cessor: 
of a cluster. Tile edge predicates indicate whether an _^ 

iteration is at the boundary of a tile. Finally, iteration c-{c€Z""^|o^ct<c„ /c3=i, . . . ,(d-i)} 

sp ace coordinates test the coordinates against the limits of _^ 

the iteration space. For every position ceC, we associate a local clock at 
With a tight schedule, the cluster and virtual processor which the given virtual processor is active 
coordinates, and all but one of the global iteration coordi- 
nates are periodic with a period equal to the number of tX^)^tCi* . . . T„_iC„_,(mod y) 
virtual processors per cluster. The one global iteration coor- 
dinate that is not periodic is the coordinate chosen to be maps C one-to-one onto [0 . . . (y-1)]- Let 6t be a given 
parallel to the null direction. The cluster coordinates are ^^g- ^ish to know the set B of all of the differences 
periodic functions. The other coordinates and memory ^f positions (members of C— C, the set of differences of 
addresses are linear functions of time and the chister coor- ^^^^^^^ coordinate vectors) that can occur as values of 

30 ^ 

dinates. Most of the boolean predicate values arc defined by t^"^(t+6t)-t^"^(t). For convenience, we use x for its first 

Unear inequalities in the cluster coordinates, and as such, are (n-l) components wherever convenient. By definition, B 

also periodic. — > — » 

One approach for exploiting temporal recurrence is to consists of the position^difference vectors x that satisfy x . 

implement a circular buffer of depth equal to the number of x =St (mod 7). By (2), we have 
virtual processors in a cluster. This buffer is used to store 

recurring values. With each iteration, the values in the k^i-*^CiX2-^ . . . 4A'„_i>< . . . ^c„^^„_i^t (mod y) (3) 

circular buffer are advanced one position so that the recur- ^^^^ ^^^^ 8t-q,C,+ri where Oir.gc,. By 

ring value corresponding to the current iteration is gener- (3), behave that k,x,-r, modC^so thatx^ek/^, modC,. 

atcd. TTus approach has the drawback of requirmg a large ^„ that k, and are relatively prime, so that k, has an 

buffer when the cluster soe IS large^ inverse in the additive group of the integers modulo C^. 

An alternahve approach is to update the chstec coordi- ^^^^ j,^^ ^j^^^^^ ^^^^^j^j ^ ^,,^1^^^ ^^^^ 

nates c (t, p ) from their values at an arbitrary previous elements of C, it follows that there are only two possible 

cycle but on the same processor: "c (t,7)=R( c'(t-6t,"p )) ^^"^^ ^i" 

(here R staiids for the recurrence map that we now explain.) Xj€.{l^-^ri mod Ci,(Jfc^"*rimod C^Cx} 
In this approach, the parallel compiler may select any time 

lag 6t, as long as 6t is not so small that the recurrence These are the two possible differences of the first coordinate 

becomes a tight dataflow cycle inconsistent with the selected „j -^(t^gj) ^ ^X). Tbe choice is made on the simple basis 

Iteration schedule. The form of R is straightforward Usmg of virtiich leads to a new point in the activity tabic. Only one 

a binary decision tree of depth (n-l), we find at the leaves ^ 

of the tree die increments c (t, p - c (t-6t, p ). The tests at 

the nodes are comparisons of scalar elements of T(t-5t,7) ci(r+dr) = ci(r) (ci(r) + ArVmiodC,) < C| | 

with constants that depend only on ~C and the schedule x . 55 (rimodcj-Q otherwise ) 
They are thus known at compile time and can be hard coded 

into the processor hardware. Pursuing this line of argument, for each choice of change in 

Many quantities are linearly dependent on time and the cx)ordinate, Xj, we determine the two possible choices 

cluster coordinates, and thus, may be computed in a similar the change in the second coordinate, x^. From (3) we 
fashion. The global virUial processor coordinates v , the 60 have that k2CiX2+ . . . +k„_iCjX . . . xC,^2X„_i"(6t-kiXi) 

, , , . (modv). We already know that 8t-k,Xi is a multiple of C,. 

global Iteration space coordinates j , and the memory ^^^^ 

addresses are all linear functions of them. Once the parallel 

compiler selects the change in 6t and the change in ~c then ^{(f^-krX^JC^ (modCj) 

it can programmatically generate code to compute the 65 ^ bcfoTC, we conclude that 

changes in all of these derived values, and these changes 

appear as explicit constants in the generated code. Only one *2c{*2"*((6'-*iJfi)/<^i) mod C2,ik2'\{bt-k^t)fCJiasidC^C2) 
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Continuing m this way. we arrive at the tree of changes of ^ ^ ^^_^y «,„ponents of M(5tx w) 

cluster coordiaates. ^ ^ ^ ^ 

To illustrate this approach, consider the example shown in are in the activity table. This vector 0et us denote it by 6 [0, 

FIG. 9., ... yOJ) wiU be one of the candidates in the decision tree. 
Take 6t= 1 5 j^^^^ ^^^^^ second component of M(T+^[0, • - • ,0]) is 

C-(4,5) and still stricdy less than C^, then we are in the first case (first 

7=(7 4 20) =(k t) branch of Uie tree), or this component is strictly less than 2C^ 

where k^^? and k^-l and we simply subtract t ^ to go back in the activity table: 

thus, r,«l, ki"^ri mod Ci«7"^mod 4=3, and m "Tm m ~^rn / i ui i- 

' ^ ' ^ ^ ^ * 10 g [0, . . . ,0]- 6 [0, . . . ,0]- t 2 (phis possibly a hnear 

( 3 if c. (0 + 3 i 3 combination of t 3, . . . , t „ so that the (n-2) last compo- 

Ci(i + l) = C|(r) + ^ . , . 

i -1 othowise nents of M 8 [1,0, ... ,0] arc in the activity table), is one of 

the candidates in the decision tree. Continuing in this way, 
Now consider the decision tree branch for C,(t)=0, in we end up with at most 2("-^> vectors (at most two cases for 
which x,.C,(t^.l)-C,(t)-3, Then ^^^'^^ dimension, and only one when the corresponding 

component of the move vector is zero). 

Ml = Jf2 ■ i{St-kixi)/Ci)a)DdC2 The notation in brackets for the vectors 8 specifies if the 

((1 7 3)/4) d5 M move is nonnegative (0) or negative (1): for example, 

8 [0,1,1] corresponds to the case where we move forward in 
a -5mod5 jjjg ^jgj dimension, and backward in the two other dimen- 

= 0 sions. 

We illustrate this technique on the example shown in FIG. 
25 9 and discussed in Section 4.2.2. 
which is precisely the correct change to Cj for the times at The Hermite Form of the mapping (MT«H^ is: 
which c^ changes from 0 to 3. 
Ibe decision tree for this example is: 



30 



7 4 20^ 
10 0 
i^O 1 0 j 
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0 ' 
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0^ 
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0 


1 -2 
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5, 



if(c(l)-0){ 

c(2) += 0; ^® matrix H^, we read that we may 
^*^**c(i) += (-!)• ^ move along w-{3,0) in the virtual space (which 
Lf (c(2) < 3) { corresponds to the vector T [0,0]=(3,0-l) in the original 
}cl5c*^{^^ ^' space). If Ci+3 ^4, then we subtract the second column of 
c(2) += (-3); i-®' (4,3)7 we find the move vector (-1,-3), and we add 
} the third column to go back in the box: (3,0)-(4, 3)+(0,5)= 
) 40 (-1,2). This corresponds in the original space to 

T[l,0H3,0,-l)-(4,3,-2)+(0,5,-l)=(-lA0). Then, for 

The decision tree approach can also be explamed fi:om a both vectors, we check the last component: in the first case, 

matrix-oriented and geometric viewpoint that has the addi- ^^^^ vector is required since the second component of 

tional advantage of simplifying the computation of the (3,0) is 0. In the second case, we may have to subtract (0,5): 

different changes in coordinates, 45 ihe last candidate is thus (-l,2)-(0,5)o(-l,-3) and 

H„ is the Hermite National Form of M. T is the basis of 

Z" such that MT-H„; the first row of MT gives the time ^ [1,1 ]=(-!, 3,1). 

difference along each column vector of T, and the last rows decision tree for this example is the same as the one 

are the coordinates of the column vectors of T in the virtual provided above. 

processor array. Since the first row of MT is (1, 0, . . . , 0), 50 ^ ^ ^ diagram summarizmg how the parallel 

^ compiler generates housekeeping code that exploits tempo- 

the first column w of T connects an isodirone to the next recurrence. 

isochrone, and the remaining colunons t 2> < • • j t „ lie in an The parallel compiler begins by generating the code of the 

isodirone. An isochrone is a set of iterations scheduled to decision tree for the local virtual processor coordinates and 

start at timet. In geometric terms, it is an (n-1) dimensional 55 the mapped out coordinate (600). As illustrated in the 

plane that contains aH iterations that start at a given time. example above, the generated code at the leaves of the 

^. , . . , _ decisiontreespecifythe value of each local virtual processor 

Given the iteratwn j , what we want to find is a vector k coordinate as a function of the virtual processor coordinate 

such that M(j +k)-(t+5t, z) where z is in the activity at a selected prior time plus a constant. The time lag is 

^ . 1 ,„ , , J ^"7* -, . . <n arbitrary, yet must be large enough so that the result of the 

table. We know already that k exists and is umque since the 60 . 1 ; * * • *• 1 ui * 

. J , ^ _^ -41 u calculation of the coordinate at the prior time is available to 

schedule starts one iteration per mitiation interval 00 each ^ . j - . • *u • •* *• a 1 

• . - . ^ - .1. f * • Ti 1. .u compute the coordinate m the current iteration. An example 

processor. This can also be seen from the fact that IL has the r ^if j * j * ^u- • *• 

^ ^ of the code generated at this point is: 

C/s on the diagonal; writing k «T >^ , wc end up with a if (cl[t-2] < 2) 

triangular system that can be easily solved thanks to the cl[t] = cl[t-2] + 3 

structure of H„. We can add a suitable linear combination of ■£ < 4) 

Ta, . . . , T„ to 8tx w (the first component of M(8tx w) does c2[t] - c2[t-2] + 3 
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k[t] » k[t-2] + 4 

c2[t] = c2[t-2] - 4 

k[t] = k[t-2] + 6 
else 5 
cl[t] = cl[t-2] - 2 
c2[t] = c2[t-2] 
k[t] o k[t-2] + 7 

In the implementation^ the global iteration coordinates 
correspond to the global virtual processor coordinates, 
except for the global iteration coordinate in the mapping 
direction (this is the mapped out coordinate). As stich, the 
global iteration coordinates, except for the mapped out 
coordinate, are computed similarly as the global virtual 
processor coordinates. 

The parallel compiler proceeds to generate code to com- 
pute other quantities that are linearly dependent on the local 
virtual processor coordinates. As it does so, it augments the 
decision tree code (602) with program statements that 20 
express these quantities in terms of their values at a selected 
previous time on the same processor element plus a constant. 

The global virtual processor coordinates for the current 
iteration are equal to the local virtual processor coordinates 
plus a constant. As such, the change in the local virtual 25 
processor coordinate between the current and selected pre- 
vious time is the same as the change in the global virtual 
processor coordinate. Furtherasore, the remaining g^lobal 
iteration space coordinates are identical to the global virtual 
processor coordinates. The parallel compiler generates the 30 
program statements to compute the remaining global itera- 
tion coordinates for the current iteration in the same way as 
the corresponding local virtual processor coordinates (604). 

Continuing the example, the decision tree code now looks 
like: 35 
if (cl[t-2] < 2) 

cl [t] - cl[t-2] + 3 

i[t] - i[t-2] + 3 

if (c2 < 4) 

c2[t] = c2[t-2] + 3 
j[t]-j[t-2] + 3 
k[t] = k[t-2] + 4 

else 

c2[t] = c2[t-2] - 4 
j[t]-j[t-2]-4 
k[t] = k[t-2] + 6 
else 

cl[t] = cl[t-2] - 2 

c2[t] = c2[t-2] 50 
i[t]-i[t-2]-2 

jM-j[t-2] 

k[t] - k[t-2] + 7 

Next, the parallel compiler generates the code to compute 
memory addresses of array elements in the loop body (606). 
Based on the linear mapping of iterations to physical 
processors, the memory address of an array element, e.g., 
a[i+j-3*k+2] has the form addr_jof_element=Ci*i +Cj * j 
+Ck * k +C0. At this point, the parallel compiler has 
computed the constants CO, Ci, Cj, and Ck. Thus, it can now 
compute a change in a memory address from the changes in 
i, j, and k. The resulting decision tree code looks like: 
if (cl[t-2] < 2) 

cl[t] = cl[t-2] + 3 65 

i[t] - i[l-2] + 3 

if(c2<4) 
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c2[t] o c2[t-2] + 3 
j[t]-j[t-2] + 3 
lit] « k[t-2] + 4 

addr_of__element[t] = addr__of_elcment[t-2] + Ci*3 

+ Cj*3 + 
Ck * 4 
else 

c2[t] = c2[t-2] - 4 
j[t]=j[t-2]-4 
lit] = k[t-2] + 6 

addr__of__elemcnt[t] « addr_oL_element[t-2] + Ci*3 

+ Cj*(-4) + 
Ck* 6 
else 

cl[t] = cl[t-2] - 2 
c2[t] = c2[t-2] 
i[t]oi[t-2]-2 
j[t]-j[t-2] 
k[t] = k[t-2] + 7 

addr_of_element[t] - addr__of_element[t-2] + Ci*(-2) 

+ Cj*(0) + 
Ck* 7 

The constant expressions can be evaluated and simplified 
to a single constaot at compile time or during initialization 
in the host processor. 

The local storage for the decision tree approach can be 
implemented efficiently using a FIFO buffer of depth equal 
to the selected time lag. Initial values for the FIFO buffer 
may be computed and stored in local or global storage at 
compile time or during initialization in the host processor. 

Next, the parallel compiler generates the code to compute 
boundary predicates (608). Most predicates, except those 
dependent on the mapped out coordinate, are periodic with 
a period equal to the cluster size. Iteration space predicates 
test whether an iteration is within the bounds of the iteration 
space. They are periodic, except for those dependent on the 
mapped out coordinate. 

The tile edge predicate is used for load/store control, and 
in particular, to propagate read-only arrays. These predicates 
are periodic unless dependent on the mapped out coordinate. 

Cluster edge predicates indicate whether an iteration is at 
a cluster boundary, and as such, indicate whether to select 
local or remote data reference. 

The local storage for the circular buffer approach can be 
implemented efficiently using a circular buffer of depth 
equal to the size of the cluster. Values for the circular buffer 
may be computed and stored in local or global storage at 
compile time or during initialization in the host processor. 

4.3.6 Assembly Code Generation 

Assembly code generation is an implementation specific 
process for transforming the optimized loop nest code from 
the form used in the parallel compiler to the form used in the 
synthesis of the processor array. 

The parallel compiler uses a machine-independent repre- 
sentation for programs that is well suited for loop transfor- 
mations and other high level paralleUzing transformations. 
In particular, the parallel compiler employs routines and the 
machine independent representation of the SUIF compiler, a 
pubHcly available compiler from Stanford University. 

The synthesis process takes a machine dependent control 
flow graph (CFG) representation of the program as input. 
The specific representation used in the implementation is 
publicly available as part of the TRIMARAN software 
available from New York University. 

In the process of converting to machine dependent CFG 
form, this phase translates the high level representation to 
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operatioos for a parameterized family of processors, called 
HPL-PD, which is described in Vinod Kathail, Michael 
Schlansker, B. Ramakrishna Rau, HPL PlayDoh Architec- 
ture Spedfication: Version 1.0. Technical Report HPL-93- 
80. Hewlett-Packard Laboratories, February, 1994. S 

This phase produces two files: one contains the machine 
dependent CFG in textual form, and the other is the anno- 
tation file. 

The annotations file provides a mapping of uniformized 
arrays to expanded virtual registers (EVRs) along with data lO 
dependences expressed as time/space distances for each 
variable reference. These dependences identify inter- 
processor data communication within the processor array. 
The annotation file also specifies global variables that 
remain as global variables in the transformed loop nest. In 15 
addition, the annotation file specifies the mapping of non- 
uniformized, non-global variables to local storage elements 
(e.g., EVRs, and static registers). 

Hie synchronous processor array is a co-processor of a 
host VOW processor. The nested loop is part of a larger 20 
program executed on the host. lb execute the loop nest, the 
host VLIW communicates commands and data to the syn- 
chronous processor array through an interface. In the inter- 
face with the processor array, the host VLIW processor 
views live-in and live-out variables as residing in its local 2S 
memory. The annotation file identifies live-in/live out vari- 
ables and assigns memory address in the host processor's 
local memory space to them. In the current implementation, 
a code translator transforms the optimized code from the 
parallel compiler to the code format used in the synthesis 30 
process and generates the annotation file. The parallel com- 
piler uses a predefined naming convention for identifying 
uniformized arrays, static variables, and non-uniformized/ 
non-global variables. These naming conventions allow the 
translator to identify the variable types and generate the 35 
entries in the annotation file. 

The data flow analysis phase of the front end identifies the 
Hve-in, live out variables. These arc the variables that the 
host processor initializes before the processor array executes 
the loop nest (the live-in variables), and the variables that 4o 
host processor may query when the processor array has 
completed execution of the loop nest (the live-out variables). 

The information either from the source program or from 
ficont end computations is kept in the internal representation. 
At the time of assembly code generation, the information is 45 
written out to a file in a simple text format. 
5.0 Conclusion 

Although the preceding sections describe specific imple- 
mentations of a parallel compiler, the invention is not limited 
to these implementations. Components of the system may be 50 
used in various combinations for different computer archi- 
tectures and hardware design scenarios. In addition, the 
methods apply to a variety of parallel processor architectures 
and are not limited to synchronous processor arrays or 
co-processors that operate in conjunction with a host pro- 55 
cesser. 

The parallel compiler methods for transforming a sequen- 
tial nested loop into a parallel program, for example, apply 
to a variety of parallel computing architectures, including 
both synchronous and asynchronous processor systems. 60 
While it is particularly advantageous to generate a parallel 
program for synthesis into an application specific design, it 
is also possible to use the parallel compiler techniques for an 
existing or given multi-processor architecture. 

Some implementations may omit or use alternatives to the 65 
processes used in the implementation of the parallel com- 
piler. For example, developers of alternative implementa- 
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tions may choose to omit tiling, may use alternative sched- 
uling methodologies, may omit or use alternative loop 
transformations and control code schemes and their optimi- 
zations. 

The methods and system components may be used to 
design multiple processor arrays. In some applications, for 
example, the input program may include several loop nests. 
These loop nests may be extracted, transformed to parallel 
processes, and synthesized into distinct processor arrays or 
cast onto a single multi-function processor array. The meth- 
odology and system design described enables the use of 
parallel compiler and ILP compiler tedinology to optimize 
the parallel code and the hardware synthesized from this 
parallel code. Examples of the parallel compiler technolo- 
gies include tiling, uniformization or privatization, iteration 
mapping, clustering, uni-modular loop transformation and 
non-unimodular loop transformation. 

In view of the many possible implementations of the 
invention, it should be recognized that the implementation 
described above is only an example of the invention and 
should not be taken as a limitation on the scope of the 
invention. Rather, the scope of the invention is defined by 
the following claims. We therefore claim as our invention all 
that comes within the scope and spirit of these claims. 

We claim: 

1. In a process of transforming a nested loop having an 
iteration space defined by loop indices into a single loop for 
execution on each processor element in an array of parallel 
processors, a method for optimizing code in the single bop 
comprising: 

obtaining a mapping of iterations of the nested loop to 
processor elements in the array and a schedule of start 
times for initiating execution of the iterations on cor- 
responding processor elements; and 

from the mapping of iterations and the schedule of start 
times, generating code to compute iteration coordinates 
on a processor element for an iteration of the single 
loop based on values of the iteration coordinates for a 
previous iteration of the single loop on the same 
processor element. 

2. The method of claim 1 wherein: 

iterations are mapped to a virtual processor-null direction 
space, where virtual processors in the virtual processor 
space eadi have a corr^ponding set of iterations, 

virtual processors are mapped to processor elements such 
that a cluster of virtual processors is assigned to each 
processor element, and the iteration coordinates com- 
prise local coordinates of a virtual processor in a 
specified cluster. 

3. The method of claim 1 including: 

generating code to compute a quantity that is linearly 
dependent on the iteration coordinates using a corre- 
sponding quantity computed on the same processor 
element for a previous iteration of the single loop. 

4. The method of claim 3 wherein iterations in the 
iteration space are mapped to a virtual processor, where each 
virtual processor is assigned to a set of iterations, and each 
iteration in the set is assigned a start time of execution, and 
the quantity comprises a virtual processor coordinate in the 
virtual processor space. 

5. The method of claim 1 including: 

generating code to compu te a value of a predicate used to 
evaluate a loop boundary condition on a processor 
element for an iteration of the single loop from a 
previous value of the predicate computed on the same 
processor element for a previous iteration of the single 
loop. 
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6. The method of claim 5 wherein the loop boundary 
condition includes a test indicating whether the iteration 
coordinates are within the iteration space to determine 
whether there is an iteration scheduled for a processor 
element at a specified time. S 

7. The method of claim 5 wherein the iteration space is 
partitioned into tiles of iterations that are initiated sequen- 
tially and the loop boundary condition includes a test 
indicating whether an iteration is at a tile boundary. 

8. The method of claim 5 wherein: lO 
iterations are mapped to a virtual processor-null direction 

space, where virtual processors in the virtual proce&sor 
space eadi have a corresponding set of iterations, 
virtual processors are mapped to processor elements such 
that a cluster of virtual processors is assigned to each 
processor elemeat, and the loop condition includes a 
test indicating whether an iteration is at a cluster 
boundary, the cluster boundary being defined as itera- 
tions at an edge of the cluster shape, and the cluster 
shape being defined by the mapping of virtual proces- 
sors to the processor elements. 

9. The method of claim 1 including: 

generating code to compute a memory address for ao 
array element in an operation within the loop horn a ^ 
previous value of the memory address computed on the 
same processor element for a previous iteration of the 
single loop. 

10. The method of claim 1 wherein the schedule provides 
start times for initiating execution of the iterations on a 
processor element such that no more than one iteration is 
started on a processor element for each initiation interval. 

11. The method of claim 1 wherein: 

iterations in the iteration space are mapped to a virtual 
processor space, where each virtual processor is 35 
assigned to a set of iterations, and each iteration asso- 
ciated with a virtual processor is assigned a time of 
initiation, 

a cluster of virtual processors is mapped to each processor 
element, and the frequency with which an iteration 40 
associated with a virtual processor in the cluster is 
initiated on the physical processor of the cluster is 
periodic; 

the method further including: 

generating code to buffer data computed for an iteration so 
thai the data is propagated to a subsequent iteration on 
the processor for use in calculating the iteration coor- 
dinates or values that are linearly dependent on the 
iteration coordinates. 

12. The method of claim 1 including: 

generating code representing a decision tree that imple- 
ments the computation of the iteration coordinates from 
values of the iteration coordinates on the same proces- 
sor at an earlier time; 

wherein the decision tree is a binary tree, the binary tree 
has a depth equal to a number of dimensions of a cluster 
having a dimension more than one, a test at each 
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internal node of the decision tree compares one cluster 
coordinate to a constant, and the leaves of the tree 
specify for a current iteration a change in iteration 
coordinates relative to previous iteration coordinates 
calculated in the same processor. 

13. The method of claim 12 wherein the changes in 
iteration coordinates specified in the leaves of the tree are 
used to compute linearly related quantities to the iteration 
coordinates using data propagated from a previous iteration 
on the processor element 

14. The method of claim 13 wherein the linearly related 
quantities include array indices of a variable in the loop 
body. 

15. The method of claim 13 wherein the linearly related 
quantities include memory addresses of a variable stored 
external to local memory of the processor element. 

16. The method of claim 1 wherein: 

iterations in the iteration space are mapped to a virtual 
processor space, where each virtual processor is 
assigned to a set of iterations, and each iteration asso- 
ciated with a virtual processor is assigned a time of 
initiation, 

a cluster of virtual processors is mapped to each processor 
element, and the frequency with which an iteration 
associated with a virtual processor in the cluster is 
initiated on the physical processor of the cluster is 
periodic; 

the method further including: 

generating code to buffer data that is periodic so that 
periodic a periodic quantity is propagated to a subse- 
quent iteration for re-use in the subsequent iteration. 

17. The method of claim 16 wherein the periodic quantity 
is boolean value representing a test of a loop boundary 
condition. 

18. The method of claim 16 wherein the periodic quantity 
is an iteration coordinate. 

19. A computer readable medium on which is stored 
software for performing the method of claim 1. 

20. In a parallel compiler for transforming a nested loop 
having an iteration space defined by loop indices into a 
single loop for execution on each processor element in an 
array of parallel processors, a compiler system for optimiz- 
ing code in the single loop comprising: 

means for accessing a data structure representing the 
mapping of iterations of the nested loop to processor 
elements in the array and a data structure representing 
a schedule of start times for initiating execution of the 
iterations on corresponding processor elements; and 

means for generating code to compute iteration coordi- 
nates on a processor element for an iteration of the 
single loop &om iteration coordinates computed on the 
same processor element for a previous iteration of the 
single loop £rom the mapping of iterations and the 
sdiedule of start times. 
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